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Abstract 

This  research  addressed  the  USAF’s  unprecedented  proactive  persistent  surveillance  Long  Term 
Challenge.  Specifically,  we  aimed  at  a  substantial  enhancement  of  the  ability  to  conduct  autonomous, 
video  based,  persistent  intelligent  surveillance,  reconnaissance  and  threat  assessment  in  highly  uncertain, 
adversarial  scenarios  such  as  urban  environments.  At  its  core  was  a  novel  approach,  stressing  dynamic 
models  as  key  enablers  for  finding,  tracking  and  anticipating/assessing  behavior  of  multiple  targets  using 
as  inputs  data  streams  from  spatially  distributed  sensors.  It  included  both  theory  developments  in  an 
emerging  new  field  -dynamics  based  extraction  of  information  sparsely  encoded  in  high  dimensional 
data-  and  an  investigation  of  implementation  issues. 

Motivation 

Controlled  dynamic  vision,  the  confluence  of  computer-vision  and  control,  is  positioned  in  an 
optimal  situation  to  address  the  specific  needs  created  by  the  move  of  the  USAF  towards  a 
“digitized”  battlefield.  Smart  UAVS  can  carry  out  intelligence  gathering,  target  tracking  and 
airspace  denial,  while  minimizing  the  risk  of  loss  of  life.  Proactive  interfaces  can  interact  better 
with  human  operators,  allowing  them  to  concentrate  on  critical  tasks.  In  addition,  the  same 
technology  can  substantially  benefit  the  general  population.  User  aware  environments  can  enable 
an  aging  population  to  carry  on  independent  lives.  Finally,  intelligent  surveillance  systems 
capable  of  detecting  suspicious  activities  will  improve  our  ability  to  prevent  tragedies. 

Several  proof-of-concept  systems  illustrating  the  ability  of  dynamic  vision  to  successfully 
handle  many  of  the  challenges  posed  by  these  applications  have  already  been  developed. 
However,  successful  autonomous  operation  of  these  systems  in  highly  uncertain,  unstructured 
environments  requires  developing  new  mechanisms  for  robustly  and  timely  extracting  actionable 
information  that  is  sparsely  encoded  in  extremely  large  data  streams. 


(a)  (b)  (c)  (d) 

Fig.  1:  Examples  of  sparsely  encoded  visual  information,  (a)  Target  tracking  in  an  urban  canyon,  (b)  and  (c) 
sample  frames  showing  contextually  abnormal  events:  onset  of  a  tunnel  fire  and  a  person  entering  through 
an  exit,  (d)  Tracking  multiple  targets.  In  all  cases  less  than  0(  10‘6)  of  the  data  is  relevant. 

The  challenges  entailed  in  this  task  are  illustrated  in  Fig.  1.  In  all  cases,  decisions  must  be  taken 
based  on  events  discernible  only  in  a  small  fraction  of  a  very  large  data  record:  a  short  video 
sequence  adds  up  to  megabytes,  yet  actionable  information  (a  change  of  behavior  of  a  single 
target),  may  be  encoded  in  just  a  few  frames,  e.g.  less  than  10"6  of  the  total  data.  Additional 
challenges  arise  from  the  quality  of  the  data,  often  fragmented  and  corrupted  by  noise. 

This  research  sought  to  address  these  issues  by  developing  methods  at  the  confluence  of  robust 
dynamical  systems,  information  based  complexity,  machine  learning  and  computer  vision,  laying 
the  foundation  for  a  new  class  of  robust,  autonomous  vision-based  systems. 

Description  of  the  Approach  and  Results  Obtained. 

Below  we  summarize  the  results  obtained  in  the  course  of  this  research.  Technical  details  are 
provided  in  the  cited  publications,  which  can  be  obtained  by  contacting  the  authors  or  from 
http://robustsystems.ece.neu.edu.  This  site  also  contains  presentations  explaining  these  results  in 
detail  and  several  demos. 

Conceptual  foundation:  Dynamic  models  as  information  encoding  and  predictor  paradigms. 
The  basic  premise  underlying  this  research  was  that  relevant  spatio/temporal  information,  at  the 
granularity  levels  required  by  autonomous  vision-based  systems  endowed  with  analysis  and 
decision  making  capabilities,  can  be  compactly  encapsulated  in  dynamic  models,  whose  rank,  a 
measure  of  the  dimension  of  useful  information,  is  often  far  lower  than  the  raw  data  dimension. 
This  premise,  amounting  to  a  reasonable  “localization”  hypothesis  for  spatio/temporal 
correlations,  allows  for  reducing  each  subproblem  -tracking,  multicamera  coordination,  dynamic 
data  interpolation  and  segmentation,  robust  decision  making-  to  the  prototype  system-theoretical 
problems  discussed  below.  Embedding  these  problems  in  the  conceptual  world  of  dynamical 
systems,  made  available  an  extremely  versatile  ensemble  of  methods  that  allows  for  recasting 
them  into  a  tractable,  finite  dimensional  convex  optimization  form  that  can  be  efficiently  solved. 

Basic  Science  Problems.  Application  of  the  ideas  outlined  above  to  the  problems  arising  in  the 
context  of  persistent  surveillance  required  addressing  the  following  basic  science  issues: 

(a)  Robust  identification  of  Hammerstein-Wiener  systems  with  high  dimensional  output 

spaces.  The  problems  of  interest  in  this  research  were  characterized  by  the  need  to  identify 
systems  whose  outputs  evolve  in  extremely  high  dimensional  spaces:  tracking  target  motion  and 
appearance  changes  involve  considering  the  evolution  of  ~  0(1 03)  pixels,  even  if  using  low 


resolution  images.  To  address  this  challenge,  we  exploited  the  high  degree  of  correlation 
between  these  pixels  to  embed  the  raw  data  in  low  dimensional  manifolds.  Since  the  projections 

to/from  these  manifolds  can  be  modeled 
as  memoryless  non-linearities,  this 
approach  led  to  the  identification  problem 
shown  in  Fig.  2.  Here  Fli(.)  and  no(.)  are 
memoryless  nonlinearities,  u,  y,  um  and 
ym  represent  the  respective  input,  the  raw 
data  and  their  projection  on  the  low 
dimensional  manifold,  and  S  is  the  dynamic  model.  In  the  course  of  this  research  we  have 
established  that  these  problems  are  generically  NP-hard  [PI]  and  proposed  a  polynomial  time 
relaxation  with  suboptimality  certificates  [P2],  For  the  special  case  of  rational  nonlinearities 
(corresponding  to  rational  embeddings),  we  have  developed  a  computationally  attractive 
relaxation  based  on  recent  results  on  polynomial  optimization  [P12,P22],  For  the  case  of  general 
nonlinearities,  we  have  developed  a  convex  optimization  based  approach  for  identifying  the 
embedding  manifold  and  correspondences  by  reducing  the  problem  to  a  nuclear  norm 
minimization  subject  to  local  isometric  constraints.  The  resulting  algorithm  has  the  ability  to 
exploit  spatio-temporal  dynamic  constraints  (while  retaining  computational  complexity 
comparable  to  the  state-of-the  art)  thus  leading  to  embeddings  that  are  robust  to  outliers  and 
provide  the  most  parsimonious  model  that  explains  the  data  [PI 9].  Finally,  we  have  taken 
preliminary  steps  towards  developing  fast  algorithms  that  exploit  the  intrinsic  structure  of  the 
problem,  leading  to  substantial  improvement,  both  in  computational  time  and  memory 
requirements,  over  conventional  semi-definite  programming  tools  [P30]. 

(b)  Robust  identification  of  hybrid  systems.  Cases  involving  a  transition  between  different 
models,  e.g.  substantially  different  appearances,  can  be  modeled  as  a  mode-transition  in  a 
piecewise  affine  switched  system.  In  the  case  of  noisy  measurements,  existing  identification 
methods  lead  to  computationally  hard  problems  with  poor  scaling  properties.  We  have  shown 
that  these  difficulties  can  be  circumvented  by  recasting  the  problem  into  a  dynamic  sparsification 
form,  where  one  seeks  the  sparsest  dynamical  model  that  explains  the  data.  Exploiting  recent 
results  from  polynomial  optimization  allows  for  developing  tractable  relaxations  with  optimality 
certificates.  Moreover,  these  relaxations  exploit  the  underlying  structure  of  the  problem  to 
substantially  reduce  the  computational  burden  [P5,P8,P9,P13,P16,P17,P24,P26,P27,P28], 
Finally,  a  feature  that  rendered  the  problems  considered  in  this  research  challenging  is  the  fact 
that  the  data  records  are  often  fragmented  or  corrupted,  due  to  sensor  or  communication  channel 
outages.  As  shown  in  [P21],  this  situation  can  be  handled  by  introducing  an  additional  set  of 
variables,  subject  to  structural  sparsity  constraints  in  the  resulting  optimization  problems. 

(c)  Robust  model  (in)validation  of  hybrid  systems.  A  crucial  step  before  using  the  models 
identified  in  (b)  above,  is  to  check  their  validity  against  additional  experimental  data.  A  unique 
difficulty  in  validating  hybrid  models,  is  the  fact  that  the  mode  signal  is  typically  unmeasurable. 
As  part  of  this  research  we  have  obtained  a  necessary  and  sufficient  condition  for  a  switched 
affine  model  to  be  (in)validated  by  the  experimental  data.  The  starting  point  is  to  recast  the 
(in)validation  problem  as  one  of  checking  whether  a  semialgebraic  set  is  empty.  By  using  recent 
results  on  sparse  polynomial  optimization  we  have  shown  that  this  condition  is  equivalent  to 
strict  positivity  of  the  solution  of  a  related,  convex  optimization  problem  [P14,P24], 


Fig  2.  Dynamic  manifold  learning  as  a  Hammerstein- 
Wiener  identification  problem. 


(d)  Constrained  interpolation  of  noisy  data.  In  most  surveillance  scenarios  only  partial  data  is 
available,  due  for  instance  to  occlusion  or  limited  sensing/transmitting  capabilities.  In  these 
situations,  it  is  of  interest  to  estimate  the  missing  data,  for  instance  in  order  to  perform  data 
association  (e.g.  stitch  tracklets),  or  to  uncover  correlations  mediated  by  the  missing  elements. 
We  have  shown  that  this  interpolation  can  be  reduced  to  a  rank  minimization  problem,  which  in 
turn  (due  to  its  Hankel  structure)  can  be  efficiently  solved  using  convex  relaxations  [P7,P1 1]. 

(e)  Robust  estimation  under  1“  bounded  disturbances.  Traditional  noise  models  often  do  not 
capture  key  features  of  the  problems  of  interest  here.  As  a  simple  example,  noise  in  images 
should  be  bounded.  While  in  principle  this  feature  can  be  captured  using  truncated  distributions, 
the  resulting  problems  are  computationally  hard.  To  circumvent  this  difficulty  we  are  developing 
a  new  framework  for  robust  estimation  in  the  presence  of  unknown-but-bounded  noise.  Using  a 
concept  similar  to  superstability  leads  to  robust  filters  that  can  be  synthesized  by  simply  solving 
a  linear  programming  problem  [P4,P25].  A  salient  feature  of  this  framework  is  that  it  explicitly 
allows  for  trading  off  filter  complexity  against  worst-case  estimation  error.  We  have  extended 
these  results  to  a  class  of  switched  systems  [PI 8]. 

(f)  Robust  identification  of  sparsely  interconnected  networks.  The  class  of  problems  that 
motivated  this  research  are  characterized  by  complex  systems  composed  by  many  interacting 
agents,  each  endowed  with  its  own  dynamics.  In  these  cases  a  single  lumped  model  (e.g. 
modeling  a  complex  scenario  with  several  adversarial  teams  using  a  single  “statistical 
mechanics”  motivated  model)  is  often  inadequate  for  scene  analysis  and  trajectory  prediction. 
Rather,  what  is  needed  is  a  model  that  captures  both  the  individual  dynamics  and  the  dynamics 
of  the  interaction  between  agents.  As  part  of  this  research  we  have  shown  that  these  models  can 
be  obtained  by  describing  the  interacting  systems  as  a  graph,  where  each  node  represents  an 
homogeneous  set  of  agents  (with  its  own  dynamical  model),  and  the  links,  also  dynamical 
models,  account  for  the  interactions  amongst  groups.  As  described  in  detail  in  [P23]  the  problem 
of  identifying  both  the  graph  topology  and  the  individual  dynamical  models  can  be  reduced  to  a 
convex  optimization  problem  (via  group  sparsity  arguments)  and  efficiently  solved  by  an 
algorithm  that  only  uses  local  information. 


Applications.  The  basic  tools  outlined  above  served  as  key  enablers  to  address  the  following 
practical  problems  arising  in  the  context  of  persistent  surveillance: 


(g)  Robust  Tracking.  We  have  developed  a  new  class 
of  filters  that  do  not  require  explicitly  finding  a  model 
of  the  underlying  process  and  have  built-in  adaptation 
capabilities.  The  main  idea  is  to  predict  the  next 
position  of  the  target  as  the  one  that  is  maximally 
compatible  with  existing  data,  in  the  sense  of  leading  to 
the  minimum  order  interpolant.  In  turn,  this  problem 
can  be  recast  into  that  of  minimizing  the  rank  of  a 
suitable  constructed  Hankel  matrix,  and  relaxed  to  a 
convex  optimization  using  tools  similar  to  those  used  in 
compressed  sensing.  The  effectiveness  of  this  approach 
is  shown  in  Fig  3.  In  addition,  in  this  context  model 


Fig.  3  Sustained  tracking  in  the 
presence  of  occlusion. 


switches  are  indicated  by  a  sharp  increase  in  the  rank  of  the  Hankel  matrix,  providing  a 
computationally  efficient  way  for  segmenting  high  volume  temporal  data  [P3,P7,P1 1], 

(h)  Data  Integration  from  Multiple  Cameras.  In  order  for  a  multi-camera  tracking  system  to 
take  full  advantage  of  the  additional  information  available  from  its  multiple  sensors,  it  must 
maintain  consistent  identity  labels  of  the 
targets  across  views  and  recover  their  3D 
trajectories.  We  have  developed  a  new 
approach  to  the  problem  of  finding 
correspondences  across  views  that  does  not 
require  feature  matching,  camera  calibration 
or  planar  assumptions.  The  key  idea  is  to 
exploit  the  high  spatio/temporal  correlation 
between  frames  and  across  views  by  (i) 
associating  to  each  viewpoint  a  set  of 
intrinsic  coordinates  on  a  low  dimensional 
manifold  obtained  using  the  identification 
methods  described  in  (a)  above,  and  (ii) 
finding  an  operator  that  maps  the  dynamic 
evolution  of  points  over  manifolds 
corresponding  to  different  viewpoints.  Then,  correspondences  can  be  found  by  simply  running  a 
sequence  of  frames  observed  from  one  view  through  the  operator  to  predict  the  corresponding 
current  frame  in  the  other  view  (Fig.  4)  [P3,P20]. 

(i)  Recovering  3D  geometry  from  2D  data.  We 

have  developed  an  efficient  algorithm  based  upon 
recasting  the  problem  into  a  Wiener  system 
identification  form.  By  exploiting  dynamical 
information,  this  approach  can  recover  the 
geometry  of  the  scene  up  to  an  overall  scaling 
constant.  For  comparison,  existing  approaches  can 
recover  scene  information  only  up  to  a  (time- 
varying)  projective  transformation  that  does  not 
preserve  Euclidian  geometry  [P6,P10]. 

(j)  Activity  Recognition.  We  have  shown  that  this  problem  can  be  translated  into  a 
“behavioral”  model  invalidation  form,  where  the  goal  is  to  establish  whether  two  given  time 
series  are  trajectories  (or  “behaviors”)  of  the  same  underlying  dynamical  model.  The  resulting 
problem  can  be  recast  into  a  convex  semidefinite  program  and  efficiently  solved  [P15,P16,P29]. 
Applying  these  ideas  to  the  problem  of  classifying  activities  from  the  challenging  TV 
interactions  database  led  to  a  68%  success  rate,  compared  against  the  best  reported  performance 
in  the  literature  of  54.5%.  In  the  simpler  case  of  single  activities  from  the  KTH  database,  the 
proposed  approach  had  a  93.6%  success  rate,  compared  to  92.1%  achieved  by  existing 
algorithms  [PI 5], 


Fig.  5.  Recovering  the  3D  geometry  of  a  scene. 
Left:  sample  frame.  Right:  recovered  geometry 
(red)  superimposed  on  the  ground  truth  (blue) 


Occlusion 


Left  camera  view 


Right  camera  view 


Fig.  4  Using  dynamic  manifold  mappings  to 
recreate  the  appearance  of  an  occluded  person. 


(k)  Detecting  Contextually  Abnormal 
Events.  This  problem  fits  naturally  in  the 
framework  developed  in  this  research  by 
associating  activities  to  an  underlying 
dynamical  model.  In  this  context,  a  video 
sequence  does  not  contain  abnormal 
activities  if  an  only  if  the  observed  data 
corresponds  to  an  admissible  trajectory  of  a 
system  described  by  a  graph,  where  each 
node  corresponds  to  the  dynamical  system 
associated  with  a  normal  activity  and  links 
detecting  abnormal  events  reduces  to  the 
hybrid  model  (in)validation  problem 
discussed  in  item  (c)  above.  A  simple 
example  illustrating  these  ideas  is  shown 


Fig.  6:  Anomalous  behavior  detection  as  a  switched 
(in)validation  problem.  The  top  sequence  (walk-  wait- 
walk)  is  not  (in)validated  since  both  activities  are  in  the 
database.  The  bottom  sequence(walk-jump)  is  flagged 
as  abnormal  since  it  cannot  be  generated  by  switching 
amongst  models  in  the  database. 

in  Fig.  6.  Further  details  are  given  in  [P21], 


(1)  Detecting  Coordinated  Activities.  Causal  correlations  between  individuals  can  be 
detected  by  reducing  the  problem  to  the  sparse  network  identification  problem  described  in  (f). 
In  this  context,  each  node  in  the  graph  corresponds  to  the  observed  activity  of  a  given  agent,  and 
each  link  indicates  the  presence  of  a  causal  correlation.  It  is  worth  emphasizing  that  this 
approach  requires  neither  previous  training  nor  repetitive  activities.  An  example  of  application  of 
these  ideas  is  shown  in  Fig.  7,  where  they  were  used  to  identify  the  correlation  between  agents  in 
a  complex  scenario.  Further  details  are  provided  in  [P23], 


Fig.  7:  Sample  frames  from  a  doubles  tennis  game  with  identified  causal  connections  superimposed. 
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